2A. Veispak et al. / Speech Communication 77 (2016) 1–7 using either monosyllables, sentences or another type of audi- tory stimuli, have been developed to evaluate speech perception skills in children (for review see Mendel, 2008). Intelligibility tests based on sentences are thought to com- prise utterances that are closer to the natural process of speech communication, reveal better phoneme representativeness, and can yield more accurate intelligibility data (Nilsson et al., 1994; Ozimek et al., 2012; Ozimek et al., 2010; Plomp and Mimpen, 1979b; van Wieringen and Wouters, 2008). On the other hand, in order for a speech recognition task to provide useful infor- mation about the auditory capacity of children, the test should seek to minimize the influence of cognitive and linguistic de- mands (McCreery et al., 2010; Nittrouer and Boothroyd, 1990). However, each test using different type of stimuli, contributes one piece of the puzzle toward an assessment of a child’s over- all speech perception and hence, at a minimum, the test battery should include an assessment of both word and sentence recog- nition (Mendel, 2008). Given that there are no tests available in the Estonian lan- guage for assessing children’s speech perception abilities, as a first piece of the puzzle we have developed an Estonian words- in-noise test for 7–9 years old children (EWINc). The test con- sists of 14 lists of monosyllables, where scoring is based on correctly identified phonemes. In addition to being less influ- enced by the listener’s vocabulary knowledge than whole-word scores (Boothroyd, 1968), phoneme scores increase the number of data points, decrease variability and improve the precision of interpreting small differences in performance across presen- tations (Gelfand, 2003). Scoring based on phonemes has also been demonstrated to decrease the inter-subject variability and the differences in performance between age groups (Nittrouer and Boothroyd, 1990). McCreery et al. (2010) for example, com- pared the performance of children in four different age groups (5–6, 7–8, 9–10 and 11–12 year olds) using The Computer Aided Speech Perception Assessment (CASPA; Boothroyd, 2006) and demonstrated that the differences in word scores between 7–8, 9– 10 and 11–12 year olds were not significant, but all three groups differed significantly from the 5–6 year olds. However, when pairwise comparisons were made for phoneme scores across age, it emerged that only 5–6 and 11–12 year olds differed sig- nificantly from each other. Several studies have reported that the major age effect on speech recognition is observed in children less than 7 years of age (Hnath-Chisolm et al., 1998; Siegenthaler, 1969, 1975; Wilson et al., 2010). On the other hand, children’s speech recognition performance is considered to approach that of adults by around 10 to 12 years of age (Elliot, 1979; Elliot et al., 1979; Stuart, 2005). The age brackets (7–9 years old children) chosen for the EWINc test, therefore, seem well justified by previously con- ducted research. This age is very important not only for speech and language acquisition, but also for general development as it coincides with the first two years in primary school in Estonia, a period of intense evolution in scholarly learning. The EWINc test, a relevant addition to the Estonian words in noise (EWIN) test for adults (Veispak et al., 2015) was devel- oped based on the example of the Nederlandse Vereniging voor Audiologie (NVA)-lists (Bosman, 1989; Wouters et al., 1994). In the first phase of the study the monosyllables were selected and perceptually optimized, the 14 list were created. In the sec- ond phase of the study the lists were evaluated in normal hearing (NH) children both in noise and in quiet. In general, for a good speech recognition test for children, the basic material should consist of words that are part of the known and used vocabu- lary of the children at the respective age. The words are com- bined into lists, which should have equal difficulty quantified by speech reception threshold (SRT), the slope of the performance metric in this SRT, and the precision quantified via test-retest comparisons. All of the abovementioned aspects of what con- stitutes a good quality speech test for children have been taken into account, analyzed and quantified in the current study. 2. Methods 2.1. Materials In total, 350 simple and well-known monosyllables, which should be a part of the vocabulary of children above 6 years of age, were selected from extensively used first and second grade primary school Estonian language textbooks (Tungal and Hiiepuu, 2007; Jundas et al., 2005). The words were recorded in isolation one by one (each word 10 times) by a native Es- tonian speaker (female). No carrier phrase was used. Record- ings of the words were made in a double-walled sound-proof booth in the Experimental Oto-rhino-laryngology (ExpORL) Research Group, Department of Neurosciences, University of Leuven (KU Leuven). Recordings were made with an Edirol R- 4 PRO recorder and sampled at 44 100 Hz (24 bits resolution), using a Sennheiser HS2 headset microphone. The best token of each word was selected based on the au- dibility and clarity of each individual phoneme in words and edited in Cool-Edit (Cool Edit Pro 2002, version 1.2a, Syn- trillium Software Corporation, Phoenix, AZ). Words were cut at zero crossings and scaled to their average root-mean-square (RMS) before the first perceptual evaluation. All the words were stored as ‘.wav’ files on the hard disk of a computer. Based on the spectrum of the recorded words stationary speech-shaped noise was generated. The average RMS level of the noise was rescaled to the average RMS of the words. More detailed description on the selection of the stimulus words, recording conditions, edit- ing, speech rate and noise creation can be found by Veispak et al. (2015). 2.2. Subjects, test set-up, and calibration Three groups of children participated in the evaluations. The participants of Group 1 (n=11; average age 7.0, range 6.2–8.7 years), Group 2 (n=20; average age 8.1, range 7.0–8.5 years), and Group 3 (n=12; average age 8.6, range 7.2–9.9 years) were primary school students. All NH children had pure-tone thresholds below 20 dB HL for all octave frequencies between 250 and 4000 Hz (IS0 389-8, 2004) in both ears and were native speakers of Estonian. The children in Group 1 were tested in Belgium, whereas the children in Group 2 and 3 were tested in Estonia. The majority of the participating children were multi-
A. Veispak et al. / Speech Communication 77 (2016) 1–73 lingual. The official training in the second language starts rather early and acquiring another language is difficult to avoid in the face of multilingual pop-culture and media landscape in Es- tonia. The participant was seated in a quiet room (for testing words-in-noise) or in a double wall sound-treated audiometric suite (for testing words-in-quiet) and heard the words monau- rally (right ear) through Sennheiser HDA200 earphones. The words were played directly from a Dell Latitude E6430 portable computer using the software interface APEX 3 (Francart et al., 2008) and passed through an external RME Hammerfall DSP sound card to control the level of presentation. The levels of the words and noise were calibrated with a Type 4153 artificial ear to a sound level meter Brüel & Kjær type 2250. The participants were instructed to listen and repeat aloud whatever was heard. No feedback was provided. The number of correctly identified phonemes was scored manually by one experimenter on the spot. This study was approved by the Medical Ethics Committee of the University of Leuven (KU Leuven), University Hospitals Leu- ven as well as Medical Ethics Committee of Tallinn (National Institute for Health development in Estonia); written informed consent was obtained from all the parents of our participants. 2.3. Experiment 1 The aim of the first evaluation was to select a subset of words that are equally intelligible under the same adverse conditions. The 350 monosyllables were distributed into 25 lists, 14 items in each. The items were distributed with the intention of avoid- ing category clusters (e.g. animals, body parts etc.) and keeping the representation of different phonemes in each list as equal as possible. The 25 lists were randomly divided into 5 blocks, each presented at 5 different signal to noise ratios (SNR; 1,−2, –5,−8,−11 dB) to Group 1. Every child identified 1 block of words at each SNR, starting out with the block of words, which was at the highest SNR and continued with the consecu- tive blocks in increasing difficulty. Hence, in total, as there were 11 participants, every block was identified by 2 children at 4 SNRs and by 3 children at 1 SNR. The order of the words was fixed in each list and the same 5 lists were grouped together into one block throughout the whole experiment 1. The order of the lists presented within each block was randomized for every participant. The noise level was presented continuously at 65 dB SPL. The total testing time together with a break after every block was about an hour and a half per participant. The experiment was described to the children like a computer game with difficulty levels. Children earned stickers with completing each level. While testing, children’s parents were given a list with all the stimulus words and asked to mark down the words they consider unfamiliar for their children. On average 7 words were marked as potentially unfamiliar by parents. After testing, children were asked to describe the meaning of the words noted by their parents as well as the words identified incorrectly in the easiest listening condition (SNR 1 dB). All participating children without excep- tion were fully aware of the meaning of the words indicating that their performance was not influenced by unfamiliarity of the items. 2.4. Construction of the lists The slopes and SRTs at 50% correct were determined by fit- ting a logistic function to each of the 350 words separately. As we were aiming to include words with a similar performance curve, a large portion of the words turned out to be unsuit- able for the test. Nearly one third of all of the stimulus words were excluded, due to consisting of phoneme clusters which were either perfectly audible even in the most difficult listen- ing condition (e.g. ‘kass’, ‘hunt’) or phoneme clusters, which were poorly identifiable also in the easiest listening condition (e.g. ‘nälg’, ‘memm’). The first selection of the potentially in- cludable monosyllables was based on the adults’ data. How- ever, as we were aiming to construct one set of words usable in both the adults as well as children’s version of EWIN, only the words which were suitable also in the group of children, were selected. In total, 140 words were included based on the indi- vidual SRTs at 50% of the items. In the group of adults the SRTs of the 140 monosyllables were ranging±4 dB around the me- dian SRT (Veispak et al., 2015). In children’s data the SRTs of the same words ranged between 5 to−3 dB around the median SRT. The approximate 1 dB shift in SRT values of individual words, when comparing adults and children, is compatible with the resulting SRTs when averaged over every individual subject. As the slope values were highly variable both in the group of adults (4%–16%/dB) as well as in the group of children (3%– 20%/dB), no concrete criteria for inclusion based on slope was defined. As the monosyllables were divided between the lists based on individual SRTs at 50% of the words with the intention of equal- izing the average SRTs of all lists and distributing phonemes as equally between the lists as possible, the resulting distribution of words in EWINc (see Appendix) differs slightly from the ver- sion for adults. The frequencies of occurrence of 504 phonemes (14 x (3 + 33)) are listed as percentages as well as averages per list in Table 1 in descending order. 2.5. Experiment 2 For words-in-noise testing the 14 lists were randomly divided into 4 blocks (3 + 3+4 + 4) and presented to Group 2 of NH children (n=20) at 4 SNRs (−2.5,−5,−7.5, and−10 dB). As the average performance of the children from the 1st experiment at−11 SNR was lower than 30% phonemes correct, the SNRs of the 2nd experiment were adjusted to provide detailed data to calculate the SRT at 50% but to minimize zero performance. The speech-weighted noise was presented continuously at 65 dB SPL. Hence, every block was identified by five children at each SNR. For words-in-quiet, based on a pilot experiment, the stimuli were presented at the following levels: 35, 30, 25, 20 and 15 dB SPL. The 14 lists were divided into 5 blocks (3 + 3+3 + 3+2) and presented to the Group 3 of children. Every child identified one block of words on every presentation level in increasing difficulty. Hence, every list was identified by 3 children at 2 presentation levels and by 2 children at 3 presentation levels. The order of the words was fixed in each list and the same lists were
4A. Veispak et al. / Speech Communication 77 (2016) 1–7 Table 1 Frequencies over all the lists, average frequency per list and percent frequency of occurrence of the phonemes in descending order. IPA504per list100% K[k]453.218.93 L[l]372.647.34 R[r]352.506.94 A[ɑ]332.366.55 P[p]282.005.56 N[n]271.935.36 I[i]241.714.76 T[t,]241.714.76 E[e]231.644.56 S[s]231.644.56 O[o]191.363.77 V[v]191.363.77 U[u]181.293.57 H[h]181.293.57 Õ[ɤ]171.213.37 M[m]161.143.17 D[d̥]141.002.78 G[ɡ]141.002.78 Ä[æ]110.792.18 Ii[iː]80.571.59 Oo[oː]70.501.39 J[j]60.431.19 Uu[uː]50.360.99 Nn[nn]50.360.99 Ü[y]40.290.79 Ll[ll]40.290.79 Ee[eː]30.210.60 B[b]30.210.60 Tt[tt,]30.210.60 Ss[ss]30.210.60 Aa[ɑː]20.140.40 Öö[øː]20.140.40 Mm[mm]20.140.40 Kk[kk]10.070.20 Pp[pp]10.070.20 grouped together into one block throughout either the testing in quiet and in noise. However, the order of the lists presented within each block was randomized for every participant. The total testing time was about half an hour per subject for testing both in noise as well as in quiet. 2.6. Structure of the test and principle of scoring The EWINc test for children is identical in principle to the EWIN adult version (Veispak et al., 2015), consisting of 14 lists. The performance test score is based on the percentage of cor- rectly identified phonemes. Each of the 14 lists consists of one 3-phoneme practice item and 9 testing items, of which 3 are 3- phoneme and 6 are 4-phoneme words. In every list, there’s one three-phoneme practice item, which precedes the 9 test items and is not scored. As the 9 test items in each list consist of 33 phonemes in total, each phoneme equals 3%. Hence, the to- tal percentage score equals the number of correctly identified phonemes multiplied by 3, and 1% is added if the total score is higher than 50%. Table 2 Average SRTs (dB) and slopes, together with their precision values. Average SRTPrecisionAverage slopePrecision Noise, fitted−8.5 dB SNR±0.7 dB8.6%/dB±1.7%/dB Quiet, fitted22.0 dB SPL±0.7 dB SPL5.2%/dB±0.7%/dB 3. Results 3.1. Norm values and reference psychometric function Slopes and SRTs at 50% are based on non-linear regression fits to a logistic function (SAS 9.3) using the following Eq. (1): P=100 (1+e(4×s×SRT−i 100 ))(1) wherePis a percentage of correct recognition,eis the base of natural logarithms (∼2.72),sis the regression slope, andi is the level of presentation either in dB SPL or dB SNR. The starting values of the parameters for iterations were defined ass=8%/dB; SRT= −3 dB SNR for words-in-noise; and s=8%/dB;SRT=15 dB SPL for word-in-quiet. The slopes and SRTs at 50% (Table 2) are the arithmetic av- erage of the individually fitted SRTs and slopes of the different subjects obtained at fixed levels in quiet and in noise with data aggregated over the lists. The precision (error) bars on both pa- rameters were deducted from the quadratically averaged error bars of the fit to the data of each individual subject. Table 2 shows that the slopes at 50% point are 8.6%/dB for words-in-noise and substantially shallower for words-in-quiet (5.2 %/dB). The fact that the items of the test were selected based on noise data and the lists were optimized for the use in noise also explains the shallower slope in quiet. Psychometric functions of words-in-noise and in quiet based on phoneme scores are presented in Fig. 1. The differences between mean phoneme- and word scores at each presentation level both in quiet as well as in noise are shown in Table 3. Across presentation levels the phoneme scor- ing gave on average 22% higher speech intelligibility scores over the whole-word scoring in quiet and 20% higher scores in noise. Comparing the SRTs and slopes based on the phoneme scoring (Table 2) with the respective values of the word scoring (Noise: SRT –5.7 SNR, slope 9.0 %/dB; Quiet: SRT 28.1 dB SPL, slope 5.3%/dB) shows, that while the slopes remain approximately the same, the difference in SRTs is 6.1 dB in quiet and 2.8 dB in noise. 3.2. Within-group variability The performance of young children in identifying words pre- sented in noise as well as in quiet has been shown to be poorer compared to older children and adults (Elliot et al., 1979). As the age of the children in Group 2 (n=20; average age 8.1, range 7.0–8.5 years) and Group 3 (n=12; average age 8.6, range 7.2– 9.9 years) varied, a simple linear regression was calculated to predict children’s performance in noise (Group 2) and in quiet (Group 3) based on age. The results show that age does not